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Abstract 



Distributed Hash Tables offer a resilient lookup 
service for unstable distributed environments. 
Resilient data storage, however, requires addi- 
tional data replication and maintenance algo- 
rithms. These algorithms can have an impact on 
both the performance and the scalability of the sys- 
tem. In this paper, we describe the goals and design 
space of these replication algorithms. 

We examine an existing replication algorithm 
and present a new analysis of its reliability. We 
then present a new dynamic replication algorithm 
which can operate in unstable environments. We 
give several possible replica placement strategies for 
this algorithm, and show how they impact reliabil- 
ity and performance. 

Finally we compare all replication algorithms 
through simulation, showing quantitatively the dif- 
ference between their bandwidth use, fault toler- 
ance and performance. 



1 Introduction 

Grid Computing and Content Distribution appli- 
cations place demanding scalability and fault toler- 
ance requirements on underlying services. Client- 
Server based solutions have often encountered 
problems meeting these requirements [1, 7]. 

Peer to peer technology has demonstrated excel- 
lent scalability in file sharing applications [17]. It 
seems likely this potential will be invaluable in solv- 
ing scalability problems for more legitimate grid 
service applications. 

Much recent research in the area of peer to peer 
computing has focused around various kinds of dis- 
tributed hash table (DHT). These algorithms pro- 
vide scalable fault tolerant key based routing. Key 
lookups can be routed reliably to the host respon- 



sible for them, which return any associated data. 

In this paper we explore the potential of Dis- 
tributed Hash Tables to provide reliable, scalable 
and consistent storage of mutable data. We show 
how the choice of replication algorithm can affect 
the performance, reliability and bandwidth costs 
of storing data in a Hash Table, a topic that has 
previously received little attention. 

We will first give a brief overview of distributed 
hash tables, and describe the aims and design space 
of replication algorithms. 

We then describe both an existing replication al- 
gorithm and a new replication algorithm based on 
dynamic replication[18], but adapted to provide re- 
liability in an unstable environment, investigating 
the performance and reliability of both. 

We then provide an analysis of the impact of 
various replica placement strategies possible with 
our dynamic replication algorithm. We show that 
replica placement can have a significant impact on 
the reliability and performance of the system. 

Finally, we use a simulation to give a quantita- 
tive comparison of the performance and bandwidth 
costs of the algorithms we have described. 

2 Distributed Hash Tables 

Distributed hash tables provide a solution to the 
lookup problem in distributed systems. Given the 
name of a data item stored in the system, we can 
locate the node on which that data item should 
be stored. Most DHTs aim to find the responsible 
node with delay logarithmic in the number of nodes 
in the network. 

There are many different DHT implementa- 
tions, including PAST[16], Tapestry[22], CAN[14], 
Kademlia[9] and Chord/DHash [11]. A node in a 
DHT is typically responsible for the data close to 
it according to some distance function. Each node 
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maintains knowledge about a small proportion of 
other nodes in the network, and uses these to for- 
ward requests for keys it does not own to nodes 
which arc closer to the requested key. Usually, the 
distance is reduced by a constant fraction at each 
routing hop, leading to lookup time logarithmic in 
the number of nodes. 

In any large scale distributed system, nodes are 
likely to be joining and leaving the system con- 
stantly. These changes in the set of participating 
nodes are called churn. DHT routing is often tol- 
erant of node churn, but storage is not. When 
a node fails, the key-value pairs it was responsi- 
ble for become inaccessible, and must be recovered 
from elsewhere. This means that to provide relia- 
bility a replication algorithm must store and main- 
tain backup copies of data. It must do this in a 
manner scalable both in the number of nodes and 
the quantity of data stored in the DHT. 

In this paper, we will discuss replication in the 
context of Chord. Briefly, Chord nodes and data 
items arc given IDs between and some maximum 
K, which map to positions on a ring. The dis- 
tance function between IDs is the clockwise dis- 
tance around the ring between them. A node owns, 
or is responsible for, data that it is the first clock- 
wise successor of. In order to route key requests, 
each node maintains knowledge of its immediate 
clockwise neighbors, called its successors, and sev- 
eral other nodes at fractional distances around the 
ring from it, called its fingers. Space concerns pro- 
hibit giving a full description of Chord, and for a 
full description readers are encouraged to consult 
the original paper [11] 

3 The Aims of Replication 

A Replication algorithm aims to achieve some com- 
bination of the following design goals. 

Reliability The replication algorithm must not 
rely on any single node, and must recover from 
churn without user intervention. 

Scalability The replication algorithm should scale 

to storing large quantities of data on a large 
number of nodes, N. So as not to limit the 
scalability of Chord, per node replication al- 
gorithm state and bandwidth usage should be 

0{log{N)). 



Lookup latency The replication algorithm may 
reduce the time taken to look up information 
by placing replicas of data in a manner that 
allows network locality to be exploited. 

Mutability Updating data involves enumerating 
all replicas. Once this is done, a distributed 
commit protocol can be used to update the 
data consistently at all locations. For this rea- 
son, enumerating all replicas should be as fast 
as possible. 

Load balancing The replication algorithm may 
provide caches of more popular items in or- 
der that the load is evenly balanced among all 
the nodes in the network. 

Consistency If the replication algorithm is to 
work with mutable data, it should seek to pro- 
vide clear guarantees about the consistency of 

replicas. 

Different algorithms achieve these aims to vary- 
ing extents. The choice of replication strategy may 
depend on which goals are more important to the 
task being considered. 

Other aims may include resilience to malicious 
nodes, anonymity or privacy. These are important 
goals, but we consider them orthogonal to the repli- 
cation problem, and best treated separately. 

4 Replication Algorithms 

A replication algorithms can be characterized by 
its approach to four key problems: 

Replica Maintenance Node churn will cause 
replicas to be lost. The replication algorithm 
must detect and repair these lost replicas with- 
out using excessive bandwidth. 

Replica Addressability In order to update an 
item, we need to locate all replicas of that item. 
Ways of doing this include limiting replica 
placement to a fixed number of nodes, search- 
ing for replicas, and periodically deleting old 
replicas. 

Replica Placement The replica placement strat- 
egy determines which nodes replicas should be 
stored on. This can have an impact on both 
performance and reliability. 
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Replica Cardinality The number of replicas we 
keep of a given key may eitiier be fixed in ad- 
vance, or allowed to vary according to the key's 
popularity. Variable cardinality often provides 
superior load balancing, but makes address- 
ability more difficult to achieve. 

In this paper, we consider replication by storing a 
complete copy of the data associated with each key 
on another node. We believe that tlw^ algorithms we 
will describe could, with some adaptation, also be 
applied to erasure coded fragments of the original 
data, with possible performance benefits in some 
circumstances [1 5] . 

We will now give details of how two replication 
algorithms approach these tasks. 

5 DHash Replication 

This replication algorithm is a combination of the 
replica placement strategy first proposed in the 
original Chord paper [11], and a maintenance algo- 
rithm proposed in [4] . The techniques are combined 
in the DHash [3] storage system. 

The placement strategy is simple, replicas of a 
data item are placed on the r successors of the node 
responsible for that data item and nowhere else. 

Newly joined nodes will inherit a keyspace from 
the node they precede, and are sent the data they 
become responsible for when the maintenance al- 
gorithm next runs. 

The DHash maintenance algorithm runs two 
maintenance protocols in order to prevent the num- 
ber of replicas from either dropping too low or ris- 
ing too high: 

Local Maintenance A node sends a SYNCHRO- 
NIZE message containing the key range it 
is responsible for to its r successors. These 
nodes then synchronize the contents of their 
databases so that all keys in this range are 
stored on both the root node and its succes- 
sors. 

EfRcient methods for database synchroniza- 
tion, such as Merkle Tree hashing[10], are dis- 
cussed in [4]. 

Global Maintenance A node periodically checks 

its database of keys to see if it has any keys it 
is no longer responsible for. To do this, it looks 



up the owner of each key it stores, and checks 
the successor list of that owner. 

If it is within r hops of the node, then it will 
be within the first r items of the successor list. 

If its ID is not in this list, the node is no longer 
responsible for keeping a replica of this item, 
and the node offers the data item to the current 
owner. It can then safely delete the key. 

If all replicas are to be repaired by a single main- 
tenance call, the local maintenance algorithm must 
run two passes, the first gathering all replicas in the 
root nodes key range onto the root node, the second 
distributing these replicas onto all successors. 

5.1 Fetch Algorithm 

If we adopt this replication algorithm, we can use 
the following fetch algorithm to retrieve the data 
associated with a key. This algorithm will also 
share the load of providing a data item between 
all the replica holders. 

Algorithm 1 Fetch for key under Successor Repli- 
cation 

successors <— findSuccessors{key) 

while -'item and -'Successors.isEmptyQ do 

node ^ successor s .popRandomQ 

item <— node.get{key) 
end while 



5.2 Maintenance and Reliability 

In order to keep the system reliable, we must both 
store replicas and repair them regularly. The more 
replicas we keep, the less frequently maintenance is 
required. This means that, for a given level of reli- 
ability, there is a trade-off between the bandwidth 
used by maintenance algorithm runs and the disk 
space used for storing replicas. 

Here, we attempt to give some new insight into 
this trade-off, and show the level of replication and 
maintenance necessary to provide a given level of 
reliability. 

For a given data item to be lost, all r of the 
nodes holding replicas of it must fail. If a replica 
is missing from the system with probability p, the 
probability of a given data item being lost is simply 

p\ 
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For the purposes of providing reliable storage, 
however, we are concerned not with the probability 
of a given item being lost, but with the probability 
of data loss anywhere in the system. For this to 
happen, a node and its r successors must all have 
failed at some point in the ring. 

To determine this probability, wc model the 
chord ring as a sequence of N nodes, each of which 
is missing replicas with probability p. The proba- 
bility of data loss in this model is equivalent to that 
of obtaining a sequence of r successful outcomes 
in N Bernoulli trials with probability of success p. 
This is known as the Run Problem, and the general 
solution, RUN{p, r, N), can be given in terms of a 
generating function [21]. 

rn/ \ p'^s'^il -ps) ^ P i 

= l-s+il-p)prsr^^ ^^'^' 

N 

RUN{p, r,N) = ^ cf 

i=r 

In order to relate this to maintenance intervals, 
we need to determine the proportion of replicas 
missing from the ring. We will do this in terms 
of the number of maintenance calls in one half life, 
S. A network's half life is the minimum of the times 
taken for N/2 of the nodes to leave, and the time 
N/2 nodes take to join. We will consider only sta- 
ble state systems, where nodes join and leave the 
network at the same rate. 

After one half-life, half of the nodes are newly 
joined, and contain no data^, so we take p = ^■ 
We assume churn occurs at a constant rate, so that 
a fraction ^ of the way through a half life, P = jg- 

We make the simplifying assumption that data 
transfer time is a negligible proportion of the half 
life, so that a maintenance call will instantly re- 
turn the system to its ideal state, provided no data 
has been lost completely before it runs. Each half 
life can then be divided into S independent main- 
tenance intervals, during each of which data is lost 
with probability RU N{^,r, N). The overall data 
loss probability is the probability that any of these 

^This is also a simplification. Although most missing 
data is caused by empty new nodes, a node also stores data 
on its r successors, and so a failure causes one fewer replica 
to be stored. Taking this into account gives the fraction of 
missing replicas as ^^^g ■ Our approximation is fair for large 
r. 
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Figure 1: Minimum repairs necessary to maintain 
FAIL{N, r, S) = 10'^ for N = 500 



S maintenance intervals loses data, and so is given 
by: 



FAIL{N, r, 5) = 1 - (1 - RUN{^,r, N)f 



We can use this to determine how often the main- 
tenance algorithm needs to run in order to maintain 
a given failure probability. Figure 1 shows the mini- 
mum maintenance frequency necessary to maintain 
FAIL{500,r,S) = 10"'', where r varies from 4 to 
20. We can see that there is a clear trade-off be- 
tween bandwidth use and storage space. The num- 
ber of maintenance calls necessary drops rapidly as 
we increase the number of replicas. 

Figure 2 shows how network size affects the level 
of maintenance necessary. N is varied from 50 to 
500 for several values of r. For small r, the net- 
work size has a significant impact on the required 
maintenance frequency. As wc increase r, network 
size becomes less important in determining main- 
tenance frequency. 

It should be noted that in systems where data 

size is large or half-life is short, so that data transfer 
times become a significant fraction of a half life, 
our assumptions are not valid and more frequent 
maintenance or a larger numbers of replicas will be 
necessary. 
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Figure 2: Minimum repairs necessary to maintain 
FAIL{N, r, S) = 10"^ in networks of size N = 
50, 100, 200, 500, with r = 6, 8, 10, 12, 15 

5.3 Maintenance and Fetch Latency 

As we allow the number of repUcas to drop between 
maintenance intervals, we increase the likelihood 
that we will need to contact more than one node to 
find the data. If we again approximate the prob- 
ability of a node not having data as ^ then the 
number of nodes we need to contact, or probe, be- 
fore data is found will be geometrically distributed, 
with expected value: 



E(probes) 



2S 



25-1 



6 Dynamic Replication 

Replica Enumeration, as proposed in [18], aims to 
remove some of the placement and cardinality re- 
strictions imposed by successor replication, whilst 
preserving addressability and the ability to make 
consistent updates. 

The placement strategy for Replica enumeration 
is based around an allocation function, h{m, d). For 
each document with ID d, the replicas are placed 
at replica locations determined by h{m, d) where 
TO > 1 is the index of that instance. The alloca- 
tion function is intended to be pseudo-random, so 
that the replicas are evenly distributed about the 
address space. 

The replication cardinality is variable in a fixed 
range 1 < r < Rmax, allowing greater replication 



for items in greater demand. The mechanism used 
to decide the exact number of replicas is not spec- 
ified, but could be designed to adapt to both the 
reliability of the network, and the load on those 
nodes providing each document. 

To provide addressability, the following invari- 
ants are maintained: 

1. Replicas of an item d are only placed at ad- 
dresses given by h{m,d), where 1 < m < 
Rmax ■ 

2. For any document d in the system, there al- 
ways exists an initial replica at h{l,d) 

3. Any further replica with (m > 1) can only ex- 
ist if a replica currently exists for m — 1 

Various strategies for finding data are possible in 
this scheme. One that is generally efficient is to do 
a binary search over the range [L.Rmax] starting 
from a randomly selected index in that range. If 
the data is not replicated at a given location, we 
use invariant three to reduce the search range ac- 
cordingly. 

Dynamic replication can help alleviate the 
lookup bottleneck that affects Successor replica- 
tion. Successor replication requires that all lookups 
for a popular key are directed to that key's owner. 
With an appropriate allocation function, dynamic 
replication can place replicas at evenly spread, well 
known, locations around the ring. Lookup queries 
for the owner of a popular item are then distributed 
more evenly. 

6.1 Maintenance algorithm 

A major difficulty with the dynamic replication al- 
gorithm as originally provided is that maintaining 
these invariants in a system with a high churn rate 
is very difficult. Node departures and arrivals could 
cause any of the invariants to be violated. 

It can be shown that lookups will proceed cor- 
rectly as long as at least invariant two holds. How- 
ever, for replica addressability, invariants one and 
four must also hold. 

We modify the algorithm given in [18] with a 
maintenance algorithm to allow it to operate cor- 
rectly and reliably in a system with high churn 
rates. 

We will use the following definitions to refer to 
the various roles nodes play in holding replicas: 
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1. The node responsible for h{l,d) is the owner 
of d. 

2. The replica group for an item d are those 
nodes whose keyspace includes a replica loca- 
tion from the set {h{m, d) : 1 < m< Rmax}- 

3. The core group for an item d is the set of 
replica holders for which m < Rmin 

4. The peripheral group are those replica holders 
for which Rmin < m < Rmax- 

Since we can not rely on any single node be- 
ing available in an unreliable environment, we must 
modify our invariants. 

1. Replicas of an item d can only be retrieved 
from addresses given by h{m,d) where {1 < 
m < Rmax}- 

2. For any document d in the system, there ex- 
ists with high probability a replica in the core 
group. 

3. Any peripheral replica with (m > Rmin) can 
only be retrieved for a single maintenance in- 
terval if no replica currently exists for m — 1. 

We can now give three maintenance protocols 
which maintain these invariants under churn. 

Core Maintenance The owner of a data item 
calculates and looks up the nodes in the core 
group for the data range it is responsible for. 
For each core replica holder, the owner and the 
replica holder synchronize databases over the 
part of owner's keyspace which is mapped into 
that replica holder's keyspace. 

Core maintenance must also deal with alloca- 
tion collisions, as described in the next section. 

Peripheral Maintenance In order to maintain 
Invariant 3, a node which stores a replica with 
index m > Rmin must check that a replica 
of that item is also held on the replica prede- 
cessor, the owner of the location with index 
m — 1. 

For each peripheral replica a node holds, it 
must obtain a summary of the items with 
the previous index on the replica predecessor. 
Bloom filters[13] can be used to reduce the cost 
of these exchanges. 



These summaries can be used to remove or- 
phaned peripheral replicas from the system. 
Orphaned peripheral replicas should not be 
used to answer fetch requests, but should still 
be stored for at least one maintenance inter- 
val, as simulation shows maintenance often re- 
places the missing replica. 

Global Maintenance Each node calculates the 

replica group for each item it holds. Any items 
it is no longer a replica holder for are offered 
to their owner, then deleted. 

We cache lookups made diiring maintenance to 
reduce bandwidth costs. Cache validity is checked 
at regular intervals during maintenance. 

This algorithm attempts to restore the system to 
its ideal state each time it is run. However, between 
runs, the system is rarely in its ideal state. Thus 
we must ensure the system operates correctly where 
the invariants do not hold. Invariant two is suffi- 
cient to ensure lookups proceed correctly, though 
less efficiently, as nodes fail[18]. 

In order to update information, we must discover 
how many peripheral replicas are currently in the 
system. To be completely certain of consistency, we 
must offer all updates to all owners of peripheral 
replica locations. 

If a lower probability of a temporary inconsis- 
tency is acceptable, we can improve performance 
by offering updates only until we encounter a cer- 
tain number of empty peripheral replica locations 
since, by invariant three, it becomes increasingly 
unlikely that any further locations are occupied. 
This could dramatically improve performance if 
Rmax ^ Rmin- 

6.2 Allocation Collisions 

It is important that the images of a node's key- 
range under h(m, d) are owned by different nodes 
for each d. In some cases however, the allocation 
function will map two replicas into the keyspace 
of the same node. We call this an allocation colli- 
sion. Each Allocation collision reduces the number 
of nodes in the core replica group, reducing relia- 
bility. 

The core maintenance algorithm must keep track 

of which nodes have been allocated replicas from 
which key ranges. If an allocation collision occurs. 
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the core maintenance algorithm must instead place 
the collided keyspace must in the peripheral group, 
effectively extending the core group. This means 
we must choose Rmax so that sufficient peripheral 
locations are available to recover from all allocation 
collisions with high probability. To do this, we must 
understand how nodes are distributed around the 
ring. 

Since we have A'' nodes uniformly distributed 
throughout a keyspace of size if, the probability 
of a node being at a given ID is ^ . Therefore, the 
number of keys between nodes is geometrically dis- 
tributed with p = Using standard properties of 
a geometric distibution[20] , we can find the mean 
and variance of this distribution. 



K-N K , 
H = — - — = — {smce N) 



N N 
^2 _ KiK -1) ^K^ 

7V2 



a- = — -r^ — - = —2 {since iiT > 1) 



N2 

To recover from allocation collisions, the range 
of available replica locations should be at least this 
size. Thus, by the central limit theorem[19], the 
space occupied by r nodes will be normally dis- 
tributed with 



^ rK 

'' w 

rK^ 



And so, by standard properties of the normal 
distribution, 95% of the time, the keyspace between 
r nodes will be less than (r + 1.645-yr)-^ keys in 
length. To allow r replicas can be stored in 95% of 
cases, the range of available replica locations should 
be at least this size. 

6.3 Dynamic fetch algorithm 

The dynamic fetch algorithm needs to choose which 
indexes to lookup and in which order. In many 
cases, algorithm 2 will give good performance. 

In situations where load balancing is not critical, 
shorter fetch times can be attained by searching the 
core replica locations before trying peripheral ones. 

If maintenance is infrequent, eliminating the en- 
tire range of peripheral replicas with higher indexes 



Algorithm 2 Dynamic Fetch for key 
indexes <— [0 . . . Rmax] 
item ^ NULL 
while ^item do 

index <— indexes. popRandomQ 

item ■*— recur siveGet{key, h{index, key)) 

if -'item and index > Rmin then 

indexes. removeRange{index, Rmax) 
end if 

if indexes=[] then 

indexes <— [0 . . . Rmax] 
end if 
end while 



if one peripheral replica is found empty may re- 
sult in poor performance, and simply removing the 
replica known to be empty may be preferable. 



6.4 Recursive data lookup 

In order to increase performance when looking up 
data, we use algorithm 3 to perform recursive gets. 
This combines the Chord lookup and get messages, 
which allows any node on the lookup path of a re- 
quest to return a replica of the requested data, if it 
holds one. This avoids further lookup hops, reduc- 
ing fetch latency. 

Algorithm 3 Recursive get for key at location = 

h{m, key) for some m 

if sel f .keyrange.containsReplica{key) and 
self.store.contains{key) then 
return {self .store. get{key)) 
end if 

if self.keyrange.contains(location) then 

return NULL 
end if 

next = self .closestPrecedingNode'^ {location) 
return next.recursiveGet{key, location) 



Reciirsivc data lookup can interfere with load 
balancing, since some replicas are passed queries 
more often than others. To prevent this, an over- 
loaded node may choose to forward a recursive get 
request rather than answer using its own replica. 
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6.5 Allocation Functions 

The choice of allocation function is critical to main- 
tenance performance. For each item d that a node 
owns, it must lookup and contact every node in the 
core replica group in order to run the core mainte- 
nance protocol. 

In order that this is scalable, wc must ensure that 
as many of these lookups as possible can be satisfied 
with little network communication. This requires 
that the allocation function maps one nodes data 
onto a limited number of replica holders. 

We suggest that for a given to, h(m, d) is a trans- 
lation in d. This means that the image of one 
node's key-range is another continuous linear range 
of the same size. Since Chord nodes arc equally dis- 
tributed throughout the key space, an image of one 
node's key-range is owned by 0(1) other nodes. 

Wc will now give four allocation functions and ex- 
plore how they impact reliability and performance. 
All these functions make use of N, the number of 
nodes in the system. This value may cither be sup- 
plied by the user, or estimated at run time [2]. 

6.5.1 Successor Allocation 

h{m, d) = {d+{m- §)) mod K 

Attempting to map replicas onto successors is 
efficient, as the Chord Protocol maintains a list 

of each nodes sucxxssors on that node;, so lookups 
can often be performed without consulting another 
node. 

Because replica locations with different indices 
are relatively close together under this alloca- 
tion function, we expect some allocation colli- 
sions, which must result in the creation of periph- 
eral replicas. Using the result from section 6.2, 
we recommend that Rmax ~ Rmin is at least 
1.645\/Hm7/v, so that Rmin distinct replicas can 
be stored in 95% of cases. This consideration also 
applied to the predecessor and block allocation 
functions. 

6.5.2 Predecessor Allocation 
h{m, d) = {d — {m- ^)) mod K 



^closestPrecedingNode as described in [11] 



Because queries arc routed around the ring clock- 
wise towards the node responsible for them, a 
lookup for one node is frequently routed through 
one of its predecessors. 

Predecessor allocation aims to exploit this fact to 
reduce lookup latency. When a request for a docu- 
ment is routed through one of the replica holders, 
the recursive get algorithm allows them to satisfy 
the request before it ever reaches the node respon- 
sible, reducing the fetch latency by one or more 
hops. 

6.5.3 Block Allocation 

h{m, d) = {d— {d mod ^ ^^ax ^ 

+{dmod — ) 

K 

+{m * — )) mod K 

This allocation func;tion attempts to make replica 
groups overlap entirely with core replica groups of 
other nearby keyranges. As will be seen in the next 
section, this policy provides a lower probability of 
data loss than other placement functions. 

It also provides most of the benefits of both 
successor and predecessor replication, since most 
nodes will have replicas placed on both successors 
and predecessors. 

This allocation function is discontinuous in rf, 
and the maintenance algorithm must be able to deal 
with this when mapping ranges of keys onto other 
nodes. 

6.5.4 Finger Allocation 

h{m, d) = {d + 2('"+'')) mod K 

This allocation function again takes advantage of 
the information already maintained by the chord 
algorithm. Chord maintains routing information 
about nodes at fractional distances around the ring, 
called fingers. Placing replicas on these finger nodes 
reduces the number of lookups that must be made, 
and distributes replicas evenly around the ring. 
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Replicas 

Figure 3: Probability of failure in one halflife for 
three allocation functions under 50% failure. Pre- 
decessor allocation is not shown as its performance 
is equivalent to successor. Error bars show 95% 
confidence intervals. 

Because of the distance between replica loca- 
tions, allocation collisions are rare, and Rmax may 
be set more conservatively than for other allocation 
functions. 

6.6 Allocation functions and Relia- 
bility 

The allocation function chosen has a significant im- 
pact on reliability. Block allocation, in which core 
replica groups for one data range overlap entirely 
with core groups for nearby data ranges, produces 
only a very few core replica groups. Finger allo- 
cation produces core replica groups which overlap 
very little with other replica groups, producing a 
large number of distinct groups. 

This large number of distinct groups leads to a 
higher probability that any one of them will fail. 
We produced a simple model of a 500 node net- 
work, in which 250 nodes are marked as failed. We 
produced 10^ sample networks with this model, and 
used them to estimate the probability of any data 
loss occurring in the network with varying numbers 
of replicas. Figure 3 shows the probability of data 
loss for finger, block and successor allocation func- 
tions. Block allocation is able to achieve a more 
reliable system with a smaller number of replicas. 

We also simulated random replica placement, 
where replicas were placed entirely randomly using 
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Figure 4: Quantity of data lost in failures for Suc- 
cessor, Random and Finger allocation functions. 
Error bars show 95% confidence levels. 

a pseudo random number generator. The results 
for random placement arc almost identical to those 
for finger placement. 

In figure 4, we compare the quantity of data lost, 
given that a failure occurs. More data is lost with 
block allocation functions when a rare failure does 
occur than other allocation functions. In many ap- 
plications however, any quantity of data loss would 
be considered catastrophic. 

7 Simulation 

We now attempt order to quantitatively compare 
the performance and bandwidth usage of these 
replication algorithms. Due to the difficulty of 
managing large numbers of physical nodes[6], we 
chose to test the algorithms through simulation 
rather than through deployment. 

Our simulator is based around the SimPy[12] 
discrete-event simulation framework, which uses 
generator functions rather than full threads to 
achieve scalability. The simulator implements a 
message level model of a Chord network running 
each of the replication algorithms described. 

7.1 Simulation Parameters 

In our simulation, we chose parameters that might 
resemble a data center built from cheap commod- 
ity components. While the simulator is capable of 
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Figure 5: Get times in various system sizes. (Suc- 
cessor and Block overlap). 



running thousands of nodes, we have limited it to 
two hundred in most scenarios. This was in order 
to keep runtime reasonable for the large number of 
scenarios and algorithms we wish to test, and the 
need to repeat simulations to estimate errors. 

We simulate a steady state system in which a 
node fails every 24 hours, shortly after which a new 
node replaces it. Latency between nodes is assumed 
to be uniform, and bandwidth is assumed to be 
unlimited - messages always take the same time to 
deliver regardless of size. 

We chose fixed parameters for the chord algo- 
rithm in all simulations, with a successor list length 
of 10 and finger table size of 12. Chord repair is 
carried out at thirty minute intervals. 

We also configure the GET algorithm to search 
the core replica group before trying the peripheral 
replicas. Local and Core Maintenance algorithms 
run two passes at each maintenance interval. 

Failures are detected by timeouts, which are set 
to 3 hops for round trip communications. Recursive 
lookup timeouts are based on network size, and are 
set to 15 hops in a 200 node network. A shorter 
timeout could have been chosen, leading to shorter 
average lookup times, as failed lookups are detected 
more quickly. Short timeouts also increase band- 
width usage, however, as long-running lookups are 
reissued before they complete. 

The system is simulated for one complete half 
life, during which 50,000 sample data fetches are 
made for randomly chosen data items, and fetch 
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Figure 6: Fetch times in a 200 node system, varying 
S, the number of times the maintenance algorithm 
runs per half life. (Successor and Block overlap). 



times are logged. Bandwidth usage is also logged 
by type, allowing separation of maintenance mes- 
sages from chord repair messages. Each simulation 
is repeated 4 times to obtain a good estimate of the 
average latencies and bandwidth usage. 



8 Simulation Results 
8.1 Fetch Latency 

The fetch latency each maintenance algorithm can 
achieve depends on the network size, the frequency 
with which it is run, and the number of replicas in 
the system. 

Figure 6 shows how fetch times scale with main- 
tenance frequency. The Finger and DHash algo- 
rithms both scale as predicted in section 5.3. 

Maintenance frequency has less effect on Succes- 
sor, Predecessor and Block fetch times. The prox- 
imity of different replica indexes means that other 
replica holders often preemptively return data, so 
that it is less important that specified replica is 
present. 

The predecessor algorithm achieves the shortest 
fetch times. With predecessor allocation, queries 
for core replicas are more often routed through pe- 
ripheral replicas, which return the data preemp- 
tively. DHash fetches are never returned preemp- 
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tively, and are routed through the requesting node, 
so that they take at least one hop longer than dy- 
namic allocation. 

Figure 5 shows how fetch times scale with net- 
work size. All algorithms show logarithmic behav- 
ior, though finger allocation compares increasingly 
badly to other dynamic algorithms since as net- 
works size grows, the probability of a preemptive 
return drops. 

Increasing the number of replicas reduces the 
lookup times slightly, whereas increasing the num- 
ber of distinct data items in the system has no im- 
pact on lookup times. 

8.2 Maintenance Bandwidth Costs 
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Figure 7: Overhead bandwidth in a 200 Node sys- 
tem with varying numbers of repairs 

Maintenance bandwidth can be divided into two 
separate costs: The cost of identifying which data 
should be stored where, and the cost of moving the 
data to that location. We refer to the former as the 
maintenance overhead. 

In figure 7 we can see how overhead varies with 
maintenance frequency. DHash maintenance has 
lower overhead than the dynamic algorithms. Dy- 
namic maintenance typically involves 0{r) lookups, 
where DHash requires 0(1). The dynamic algo- 
rithms all have similar overhead, which increases 
linearly with maintenance frequency. Notably, all 
algorithms overhead bandwidth is so small as to be 
negligible in most network environments. 

Data movement bandwidth is likely to be the 
bottleneck in distributed storage systems. Figure 



2.0- 



% 1.5- 

X 



Predecessor 
-DHASH 

■Finger 
-Block 
-Successor 




1 10 1 00 

Maintenance Calls Per Half Life 

Figure 8: Proportion of data in system transmitted 
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overlap) 



8 shows all dynamic algorithms move very simi- 
lar quantities of data. At high maintenance levels, 
significantly more data is moved with DHash than 
the dynamic algorithms, since a single node joining 
produces changes in the membership of r nearby 
replica groups, with one node leaving each group. 
This causes the node expelled from each of these 
replica groups to send any replicas it no longer owns 
to their owner. 

Figure 9 shows the linear relationship between 
the number of key value pairs and maintenance 
overhead. Again, maintenance overhead is low for 
all algorithms, so that it should be feasible to store 
a very large numbers of items per node without 
maintenance bandwidth becoming a bottleneck. 

Per node maintenance overhead bandwidth and 
total data movement remain constant as we vary 
the number of nodes in the system, for all algo- 
rithms. This means all algorithms will scale well to 
very large numbers of nodes. 

8.3 Fault Tolerance 

We have so far investigated the performance of 
these data replication algorithms under a steady 
state of churn, in which new nodes join at the same 
rate as other nodes fail. The DHT can also recover 
from far higher failure rates, although there is a 
substantial performance impact. 

We simulate a simultaneous failure of a varying 
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proportion of the nodes in the chord network, and 
then launch 50,000 data fetches immediately after- 
ward. Unlimited retries are allowed, and the aver- 
age fetch time, including retries is shown in Figure 
10. 

The DHash algorithm is particularly affected in 
this scenario, due to its reliance on reaching a single 
node. The dynamic algorithms are more resilient 
to faulty routing, as they may select multiple in- 
dexes to look up, and because preemptive returns 
are possible even when the specific node requested 
is unreachable. 



9 Related Work 

Metadata based algorithms, such the Version ID 
system used by OceanStore[8] are another possible 
replica management option. These algorithms do 
not have placement restrictions, but instead use a 
metadata item to locate replicas. These solutions 
may perform well in many scenarios, but require 
an underlying reliable storage system for metadata. 
The algorithms we have described could be used to 
provide such a storage system. 

Soft-state storage is another common replica 
management system. With soft-state storage, no 
attempt is made to maintain replicas. Instead, data 
expires after some timeout and the system relies on 
data periodically being refreshed and reinserted by 
some external system. Though this can be useful 
for frequently refreshed data, failure of the external 
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Figure 10: Data fetch times after a simultaneous 
failure of varying proportions of nodes. (Block Suc- 
cessor and Predecessor overlap) 

storage system can cause unwanted data loss. 

10 Conclusions 

We have used a combination of analysis and sim- 
ulation to assess the ability of various replication 
algorithms to meet the goals we set out in section 
3. 

We can see that dynamic replication can achieve 
faster lookups, greater reliability and may require 
less replica movement than the DHash algorithm, 
with only a slightly higher maintenance overhead. 
We have also shown how the allocation function 
choice can have a dramatic impact on performance. 
Of the allocation functions we considered, block al- 
location provides the best reliability and represents 
a good compromise for most systems, though pre- 
decessor placement might be preferable if perfor- 
mance is critical. 

Possible drawbacks of dynamic replication are 
its slightly higher maintenance bandwidth usage 
and its reliance on an even distribution of node 
IDs, which may make it unsuitable for small Chord 
Rings. 

System size scalability is good for all mainte- 
nance algorithms. Lookup times scale with the 
logarithm of network size and total system band- 
width consumption scales linearly with the number 
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of nodes. 

On an internet wide scale, bandwidth and uptime 
are likely to be more limited than in the data center 
scenarios wc have considered. In such a system, the 
number of nodes is likely to vary throughout the 
day rather than remain constant. Although our 
work may provide insight into performance in such 
scenarios, further work needs to be done to assess 
the reliability of a system which incorporates user 
desktop systems. 
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